VANet: a medical image fusion model based on attention mechanism to assist disease diagnosis

Background Today’s biomedical imaging technology has been able to present the morphological structure or functional metabolic information of organisms at different scale levels, such as organ, tissue, cell, molecule and gene. However, different imaging modes have different application scope, advantages and disadvantages. In order to improve the role of medical image in disease diagnosis, the fusion of biomedical image information at different imaging modes and scales has become an important research direction in medical image. Traditional medical image fusion methods are all designed to measure the activity level and fusion rules. They are lack of mining the context features of different modes of image, which leads to the obstruction of improving the quality of fused images. Method In this paper, an attention-multiscale network medical image fusion model based on contextual features is proposed. The model selects five backbone modules in the VGG-16 network to build encoders to obtain the contextual features of medical images. It builds the attention mechanism branch to complete the fusion of global contextual features and designs the residual multiscale detail processing branch to complete the fusion of local contextual features. Finally, it completes the cascade reconstruction of features by the decoder to obtain the fused image. Results Ten sets of images related to five diseases are selected from the AANLIB database to validate the VANet model. Structural images are derived from MR images with high resolution and functional images are derived from SPECT and PET images that are good at describing organ blood flow levels and tissue metabolism. Fusion experiments are performed on twelve fusion algorithms including the VANet model. The model selects eight metrics from different aspects to build a fusion quality evaluation system to complete the performance evaluation of the fused images. Friedman’s test and the post-hoc Nemenyi test are introduced to conduct professional statistical tests to demonstrate the superiority of VANet model. Conclusions The VANet model completely captures and fuses the texture details and color information of the source images. From the fusion results, the metabolism and structural information of the model are well expressed and there is no interference of color information on the structure and texture; in terms of the objective evaluation system, the metric value of the VANet model is generally higher than that of other methods.; in terms of efficiency, the time consumption of the model is acceptable; in terms of scalability, the model is not affected by the input order of source images and can be extended to tri-modal fusion.

terms of scalability, the model is not affected by the input order of source images and can be extended to tri-modal fusion.
Keywords: Medical image, Medical image fusion, Attention mechanism, Contextual information, Multi scale feature extraction

Background
As an important auxiliary tool for medical diagnosis, the importance of medical images is self-evident. With the development of sensor technology, the types of medical images are becoming more and more abundant [1,2]. The information provided to doctors by different types of medical images is usually complementary and how to aggregate these complementary information into one image has become the focus of current research [3][4][5][6][7]. Figure 1 presents two modal images of a patient with mild Alzheimer's disease and their fusion results. Figure 1a is the MR-T2 image showing globally widened hemispheric sulci, which is more prominent in parietal lobes. Figure 1b is the PET image that captures signals of markedly abnormal metabolism in brain regions. Weak metabolism occurs in the anterior temporal and posterior parietal regions. The changes tend to be bilateral, but the right hemisphere is more affected than the left, with the posterior cingulate gyrus relatively unaffected. Figure 1c is the fusion result of Fig. 1a and b. Doctors can pay attention to the metabolism of abnormal parts while observing structural changes. It can be seen that medical image fusion is of great significance to clinical diagnosis. Since the quality of the fused images directly affects the doctor's judgment of the disease, how to improve the fusion quality of medical images has become an urgent problem to be solved. The quality of fused images depends on the acquisition of image features and the design of fusion rules. Traditional methods usually adopt manual design of feature extraction methods and fusion rules. Although such methods can effectively describe the detailed features of images, they can not acquire the features of images with different modalities. Human-designed image fusion rules focus more on computing weight maps, which integrate pixel activity information from different source images. In traditional fusion methods, the computation of the weight map is achieved by two steps of activity level measurement and weight assignment. Medical images are decomposed by pre-designed filters and their activity is measured by the absolute value of the decomposed coefficients. Then a "choose-max" or "weighted-average" fusion rule is applied to different measurement sources to assign weights.However, this kind of measuring activity and assigning weights are not very stable due to noise, registration and differences in pixel intensities. In order to further improve the performance of the fusion model, scholars have proposed many complex decomposition methods and designed weight allocation strategies carefully. Therefore, these methods are usually designed in steps, breaking the link between activity level measurement and weight assignment.
The medical image fusion method based on deep learning can comprehensively consider the key issues of the fusion image process. This kind of method realizes the direct mapping of source image to weight by encoding the image and completes activity level measurement and weight assignment in an "optimal" way via learning network parameters, which enhances the correlation between activity level measurement and weight assignment effectively. In all deep learning algorithms, improved algorithms based on autoencoders (AE) [8][9][10], generative adversarial networks (GAN) [11,12] and convolutional neural networks (CNN) [13][14][15] are popular in medical image fusion. Song et al. proposed MSDNet and applied it to the extraction of medical image features [16]. The multiplexing of features enhanced the expression of important information in the fused image; Kang et al. regarded the fusion of PET and MR images as a min-max optimization problem with respect to the generator and the discriminator [17]. They proposed TAcGAN model to enhance the structural features of fused images through a game of generator and discriminator, while preserving part of the information of SPECT images. Zhang et al. proposed a general fusion framework based on convolutional neural network called IFCNN [18]. IFCNN can obtain the salient features of medical images without being limited by the number of source images. The fused images preserves important features from different images better.
Although the above methods improve the fusion quality of medical images, their improvement is limited. This is because they only focus on image fusion itself, ignoring the significance of medical image fusion. Medical image fusion focuses on the global and local effects of abnormal tissue on medical images, which are often reflected in the contextual information of images. Therefore, how to obtain image context information has become the top priority of current research. In order to address this issue, we propose a new medical image fusion model on deep learning, called VAnet. The VAnet model has two most important parts, the encoder and the fusion network. The encoder consists of five convolutional pooling blocks of the VGG-16 network, which can sufficiently capture the contextual information of medical images. The fusion network adopts the method of combining residual multi-scale feature extraction and attention mechanism to realize the enhancement of salient features and the preservation of texture detail information.

Overview
VAnet is a new type of medical image fusion model. It consists of three parts: encoder, AM fusion network and decoder. In Fig. 2, the encoder consists of five coding blocks, which are corresponding to five blocks of VGG-16, respectively. The five feature maps obtained from five blocks contain all the contextual semantic information of the image. Then the feature maps are put into the AM fusion network for multi-scale deep feature fusion. The AM fusion network consists of the attention mechanism branch and the residual multi-scale detail fusion branch. The attention mechanism branch consists of the channel attention mechanism block and five convolution blocks. Among them, the channel attention mechanism block can suppress noise, especially functional images. The residual multi-scale detail fusion branch includes three convolution blocks and a multi-scale detail fusion block. Among them, the multi-scale detail fusion block can completely compensate for the loss of detail caused by the pooling operation in the attention mechanism. Finally, the fused feature map will be input to the decoder to reconstruct the fused image.

Encoder
Traditional encoders tend to ignore the context information of feature maps in feature extraction. Facts have proved that the pathological characteristics of tissues are not only reflected in a certain independent part, but also in its contextual information. Therefore, we select the VGG-16 network that can obtain context information in the encoder.
As shown in Fig. 3, VGG-16 contains five blocks. Its biggest feature is that it can obtain information about the image context. The first two blocks consist of two convolutional layers and one max-pooling layer, respectively. The last three blocks consist of three convolutional layers and one max-pooling layer, respectively. The stacking of the two can easily form a deeper network structure to obtain more complete and deeper contextual information. The kernel size of all convolutional layers is 3 × 3 and the size of max pooling layers is 2 × 2. The first four blocks have different numbers of output channels; the fourth and last blocks have the same number of output channels.

AM fusion network
AM fusion network is the core part of the VAnet model. The extraction of important features and their associated features, the suppression of noise and the preservation of texture details all rely on the fusion network. In Fig. 4, AM fusion network consists of the attention mechanism branch and the residual multi-scale detail processing branch.
Attention mechanism branch Attention mechanism branch is composed of five convolutional blocks and a channel attention mechanism block. Each convolutional block is composed of a convolutional layer, a batch normalized layer and a ReLU activation function. The kernel of the convolution layer in all convolution blocks is 3 × 3. In the first Fig. 3 The structure of the encoder of the VAnet model Fig. 4 The structure of the AM fusion network convolution block, a pooling layer is added after the activation function to reduce the feature dimension. In the fourth convolution block, we add an unsampled layer before the convolution layer to restore the feature dimension. A channel attention mechanism block is added behind the second convolution block and its working principle is shown in Fig. 5.
In Fig. 5, the size of the input feature map F is H × W × C , which is put into the max pooling layer and the average pooling layer to obtain two 1 × 1 × C feature maps. Then the two feature maps are fed into a two-layer shared neural network for feature extraction. The number of neurons in the first layer of the network is C/r and the ReLU function is selected as the activation function. The number of neurons in the second layer of the network is C. The element-wise operation is performed on the features obtained by the shared neural network and the final channel attention feature Mc is generated after the sigmoid activation operation.
Residual multi-scale detail processing branch After the image is branched by the attention mechanism, the detailed information will be lost, which will affect the fusion result of the image. In order to avoid the above situation, the residual multi-scale detail fusion block is designed. The residual multi-scale detail processing block includes a set of residual convolution blocks, a multi-scale detail fusion block and a convolution block. Among them, the residual convolution block is designed to prevent gradient explosion. The convolution kernels of all convolution blocks are set to 3 × 3. In the multi-scale detail fusion block, we use three different convolution kernels. Different convolution kernels can fuse detailed information of different scales. The selection of the convolution kernel is shown in Fig. 4. Among them, a 1 × 1 convolution kernel filter is used to process the information of different channels at the same location. Filters with 3 × 3 and 5 × 5 convolution kernels are used to process the information of the surrounding channels at the same location. The reason why a filter with a larger convolution kernel is not used to process the surrounding information at the same position is due to the consideration of the computational complexity of the model. A large convolution kernel will bring more computation to the model and affect the computational performance of the model seriously.

Decoder
The decoder is based on a nested connection architecture. Inspired by UNet++, we simplified its structure. As shown in Fig. 2, the decoder consists of ten convolutional blocks. Each convolution block is composed of two convolution layers with convolution kernel of 3 × 3. The cross-layer link connects the multi-scale depth features in the decoder. The output of the decoder is a reconstructed image fused with multi-scale features.

Loss function
In order to improve the fusion effect of the VAnet model, we use the structural similarity (SSIM) loss function, the mean squared variance (MSE) loss function and the total variation (TV) loss function to form a mixed loss function. The description of the hybrid loss function is as follows where α and β are the balance parameters. The SSIM loss function is used to measure the loss of texture details of the source image during the fusion process. The MSE loss function is used to predict the pixel-to-pixel loss between the fused image and source images. The introduction of TV loss function aims to maintain the smoothness of the image and suppress noise. The structural similarity loss function is described as where I fused represents the fused image and I source represents the source images. N is the size of the batch. SSIM(·) is used to calculate the structural similarity between images. The closer the SSIM value is to 1, the more detailed information of the source image is contained in the fused image. The MSE loss function is defined as follows where W and H are width and height of the image, respectively. (x,y) is the pixel position of the image. The total vision loss function is described as

Dataset and Experimental environment
The experimental data in the article are selected from the AANLIB database. 100 pairs of cross-modally registered brain abnormalities medical images are downloaded and cropped into 11960 patch pairs as the training set for the VANet model. The size of each patch is set to 84x84. This operation not only ensures the diversity of training data, but also enhances the robustness of VAnet. As for the test data, we randomly selected two sets of images from each of the 4 diseases to complete the test on VAnet. The training and testing of the VAnet model are all tested on a machine equipped with a 2.4 GHz Intel Core i7-11800H CPU (32G RAM) and a GeForce RTX 3070 GPU.

Comparison algorithm and metrics
In this section, eleven medical image fusion methods are selected for comparison with VAnet. These eleven algorithms are GFF [19], NSCT [20], IGM [21], LPSR [22], WLS [23], CSR [24], LRD [25], TLAYER [26], CSMCA [27], LATLRR [28] and DTNP [29]. (1) Among them, GFF, NSCT, IGM, LRD and TLAYER are traditional image fusion methods. WLS and CSMCA are deep learning fusion methods. LPSR is a fusion method based on sparse representation classes. CSR is a fusion method combining neural network and sparse representation. LATLRR is based on a low-rank decomposition fusion method. DTNP is a fusion method that combines dynamic threshold and wavelet transform. The source codes of all comparison algorithms come from the Internet and the settings of each algorithm parameters are recommended by the corresponding authors.
In order to evaluate the performance of VAnet, we selected eight evaluation metrics to analyze the fused images of all algorithms. The eight metrics are Qw [30], Qe, SSIM [31], VIF [32], FMI [33], LABF [34], NABF [35] and NCIE [36]. Among them, Qw and Qe are derived from the Piella model. SSIM is used to measure the structural similarity between the fused image and the source image. VIF stands for visual evaluation of fused images. LABF, NABF, FMI and NCIE are representative metrics for evaluating image fusion in information theory.

Training details
The training of the VANet model involves many parameters, including batch_size, learning rate, epoch, and the balance parameter in the loss function. The settings of these parameters can have a profound effect on the fusion effect. Therefore, the analysis of these parameters has important research significance.
Batch_size batch_size refers to the number of samples selected for a training and its size affects the optimization degree and speed of the model. Since the data for training VAnet model is relatively large, putting all the data into the network at one time will definitely cause a memory explosion. Therefore, batch_size needs to be introduced to solve this problem. However, the value of batch_size can not be too small. If it is too small, the learning will be random and the model will not converge. Considering the hardware environment and memory capacity of the experiment, according to Leslie's theory, we set the value of batch_size to 64.

Epoch
Epoch is an important parameter that controls the number of weight update iterations and the weight update iteration directly affects the fit and convergence of the model. In the training of the VANet model, it is not enough to train all the data in one iteration to get the model into the best fit state. Therefore, it is necessary to set an appropriate epoch value to improve the stability of the model and the effect of image fusion. VIF is a metric that evaluates image quality from the perspective of information communication and sharing based on the statistical properties of natural scenes. Since the evaluation accuracy of this metric is related to the image itself and the distortion channel of the human visual system, it is very appropriate to choose it to assist in completing the determination of the value of epoch. Figure 6 shows the trend of VIF with the transformation of the epoch.
In Fig. 6, we give the average value of VIF for 50 pairs of medical fused images. When epoch is set to 40, the corresponding images average VIF value reaches the maximum and the fused image obtained is more in line with human visual perception. Therefore, we set the value of epoch to 40 to complete the training of the VANet model.

learning rate
The learning rate is an important parameter of the VANet model, which affects the convergence of the model. If the learning rate is too large, the model will oscillate and not converge. If the learning rate is too small, the model will converge slowly. Based on the actual situation, we chose the exponential decay learning rate. The formula is as follows where lr base is the initial value of the learning rate and lr decay is the decay rate of learning rate. According to prior knowledge, the initial value of the learning rate is set to 0.1, and the decay value of the learning rate is set to 0.99.

Hyperparameters
In the loss function of the VANet model, there are two hyperparameters α and β , which are used to adjust SSIM loss function and MSE loss function respectively. With reference to other scholars setting hyperparameters for deep learning, the values of α and β are set between 0 and 0.01. Given the role of the two loss functions in the training process, we chose the evaluation metric VIF that related to the human eye perception to assist in determining the values of the hyperparameters α and β . Figure 7 shows the trend of VIF with α and β.
In Fig. 7, we give the average value of VIF for 50 pairs of medical fused images. Obviously, when α is set to 0.005 and β is set to 0.003, the average VIF value of the (5) lr = lr base * lr decay epoch Fig. 6 The changing trend of epoch corresponding image reaches the maximum value, which best meets the requirements of VANet model training.

Results
The test data are derived from the following five diseases, which are subacute stroke, hypertensive encephalopathy, cavernous hemangioma, metastatic bronchogenic carcinoma and mild Alzheimer's disease. Two pairs of the source images are selected for each disease to prove the effectiveness and superiority of our fusion model.

Subacute stroke: loss of sensation
The two sets of source images in this section are from a 65-year-old patient with subacute stroke. He is right-handed with mild left hemiplegia and atrial fibrillation. When he felt a tingling pain in his left arm, he went to the hospital and found that he could not explore the left half of the space. In his two sets of MR images, the cerebrospinal fluid left behind by the liquefaction and necrosis of the old infarct showed hyperintensity and successfully replaced the frontal pole. Hyperperfusion appears on the corresponding SPECT images. Figures 8 and 9 show the fusion results of all algorithms on two sets of subacute stroke images. The fused image based on CSR model almost loses the ability to describe functional information. The fused images obtained by LRD, IGM, TLayers and DTNP algorithms can not completely describe the blood flow level. The fused images obtained by GFF and LPSR algorithms have serious distortion. The brightness of the fused images obtained by NSCT, WLS and CSMCA is dark, which is not conducive to the description of the structural information of the image. The fused image obtained by LATLRR algorithm has serious blurring. The fused image obtained by VANet model Fig. 7 The Hyperparameters change trend graph can clearly describe the blood flow situation of the tissue, while retaining the key information in the MR images.
Tables 1 and 2 give the objective performance of different algorithms on the fusion of the above two sets of medical images, respectively. The VANet model achieves optimal values on all objective evaluation metrics. From both subjective and objective perspectives, the subacute stroke images fused by VANet model can provide doctors with

Hypertensive encephalopathy
Two sets of source images in Figs. 10 and 11 are from a young woman that has acute arterial hypertension. In her MR− T2 images, bilateral temporal and occipital lesions Fig. 9 The second set of fused MRI-SPECT images from 9 methods on subacute stroke  can be clearly seen. Early perfusion abnormalities are obvious at higher levels in her SPECT-Tl image. In order to observe the lesion tissue and its perfusion better, the two sets of images are selected for fusion on 12 algorithms and the fusion results are shown in Figs. 10 and 11, respectively. The fused images obtained by NSCT, WLS and CSMCA algorithms have a dim brightness and lose the energy information in the SPECT image. The fused images obtained by GFF, LPSR and CSR algorithms have serious distortion.      In Tables 3 and 4, it can be seen that the VAnet model is outstanding on Qw, Qe, SSIM, LABF, NABF and NCIE. On VIF and FMI, the performance of VAnet is lower than that of the LATLRR algorithm and the CSR model respectively, which may be related to the feature extraction method. However, the fused images obtained by LATLRR algorithm and CSR model lack different color information, which makes them unable to provide reliable information for doctors. In contrast, the images fused by VANet model can obtain more complete color information,which may be helpful for treating hypertensive encephalopathy.

Cavernous angioma
The experimental data is from a 26-year-old woman with a ten-year history of headaches. Recently, she received radiosurgery due to progressive weakness of the right arm and leg. Her MR images show obvious hemangiomas. Her SPECT image is marked with technetium. Among them are blood clots and scarred brains, surrounded by crystalline old blood products. The lesion can not fill the marked red blood cells, indicating that they are not open to circulating blood. In order to assist the doctor in completing the diagnosis and treatment of her disease better, her two sets of registered images were chosen to be fused. Figures 12 and 13 show the fusion results of two sets of images under different algorithms, respectively. The fused images obtained based on NSCT, WLS and CSMCA algorithms lack the low-frequency energy of the SPECT image, resulting in its dim brightness. The fused image obtained by LPSR algorithm is seriously distorted. The brightness of the fused image obtained by IGM, LRD and DTNP algorithms is too high, which affects the description of the texture information. The fused images obtained based on GFF, CSR and LATLRR algorithms describe the blood circulation process poorly. The fused image obtained by TLayers algorithm is relatively blurry and can not describe the nuclide information. The fused image obtained by VANet model is superior to other algorithms in terms of brightness, contrast and description of nuclide information. Tables 5 and 6 show the objective representation of all algorithms on the above two sets of images, respectively. With the exception of VIF and Qe, VANet achieves optimal solutions on all other metrics. Although the images obtained by IGM algorithm and DTNP algorithm are optimally solved in terms of visual fidelity and Qe metrics, respectively. Their poor performance in the fusion results has seriously affected the doctor's observation of texture details. In summary, the fused images obtained by VANet model can help doctors complete to observe and diagnose glioma diseases better.

Metastatic bronchogenic carcinoma
The experimental data comes from a 42-year-old woman who has been smoking for a long time and the sudden increase in headaches caused her to go to the hospital for a check-up. After examination, a large number of lumps appeared in her brain. The MR image demonstrates the tumor as an area of high signal intensity on proton density (PD) and T2-weighted (T2) images in a large left temporal region. Perfusion SPECT image shows very low blood flow to the lesion. In order to further combine tissue structure information and blood flow conditions to accelerate the diagnostic process, two sets of registered medical images are selected for fusion. Figures 14 and 15 show the fusion results of two sets of images under different algorithms, respectively. The fused image obtained by TLayers algorithm is blurred in texture detail. The fused images obtained based on NSCT and CSMCA algorithms have a dim brightness and lose the low-frequency energy in the SPECT image. The fused images obtained by LPSR and LATLRR algorithms show color distortion. The fused images obtained based on GFF and CSR algorithms lose the ability to describe the blood flow levels of tissues. The brightness of     improvement over other algorithms, except the LPSR algorithm. Although the LPSR algorithm and the VANet model perform equally well on all metrics, the images obtained by LPSR algorithm describe color information very poorly. In summary, the VANet model is more suitable for processing image fusion of bronchial cancer metastatic disease, which can provide great help to doctors.     Figures 16 and 17 show all the fusion results of the two sets of images, respectively. The brightness of the fused images obtained based on NSCT and CSMCA algorithms is too dark and the energy information of the PET image is lost. The image obtained by CSR algorithm loses almost all metabolic information. The fused image obtained by GFF algorithm shows serious color distortion. In the fused images obtained by IGM, DTNP and WLS algorithms, the brightness of them is too high, resulting in loss of information.   suboptimal values on the FMI metric and performs best on the remaining metrics. Combined with the fusion result, the medical images fused by VANet model can provide great help to doctors in the process of treating mild Alzheimer's disease.

Ablation study
The core of the VANet model is the attention-multiscale fusion network. Among them, the attention mechanism branch is to fuse the global context of medical images; the residual multi-scale detail processing branch is to fuse the local context of medical images. In order to verify the influence of the two branches on the fusion results, the section chooses to ignore one of the branches and use the other branch for fusion. The experimental data are 60 groups of registered MRI and their corresponding nuclear medicine images, from which we randomly select the fusion results of three groups of images and show them in Fig. 18. First, in order to verify the influence of global context on the fused images, the attention mechanism branch is ignored. The fusion results are shown in Fig. 8c, h and m. When the global context fusion is blocked, the fused image suffers from severe color distortion, resulting in a large deviation in the description of tissue metabolic information. Then, the residual multi-scale detail branch is ignored to verify the effect of local context on the fused images. The fusion results are shown in Fig. 8d, i and n. It can be clearly found that some detailed texture information is blurred, which affects the doctor's observation of key tissue information. Table 1 shows the statistical results of objective metrics of the VANet model ablation experiment and the optimal value is selected in bold.
In Table 11, it can be seen that the performance of the VANet model with the attention branch removed is significantly weaker on most metrics, especially in SSIM, FMI,

Time complexity analysis
The image obtained by the VAnet model has been subjectively analyzed and objectively evaluated before. This section will evaluate the VANet model and other algorithms from the perspective of time complexity. The time cost of each algorithm on each set of experimental images has been shown in Tables 1 to 10. From all the tables, it can be found that the LPSR algorithm takes the shortest time and the CSMCA algorithm takes the longest time. The time consumption of the LRD mehod is second only to the CSMCA algorithm. The time consumption of the CSR method and the DTMP algorithm also exceeded 10 seconds. The VAnet model takes some time to train. After the model is trained, the time it takes to fuse images is comparable to that of the WLS algorithm. However, the fusion effect of the VANet is much better than the WLS and the LPSR algorithms.

Statistical test
When comparing algorithms, it is often necessary to perform statistical tests on experimental results. Friedman test is a type of nonparametric test used to measure the performance of multiple algorithms on different datasets. However, Friedman test can only detect whether there are differences between the performance of multiple algorithms. Once there is a difference, a post-hoc test is needed to find out which algorithms have statistical differences in their performance. Nermenyi test is a commonly used method for subsequent testing. It uses Tukey's distribution to complete the critical difference (CD) calculation. The level difference of any two methods is larger than the value of CD, which proves that there is a significant difference between the two methods. In Fig. 19, the values of the objective evaluation indicators in Tables 1, 2, 3 , 4, 5, 6, 7, 8, 9, 10 are used to calculate the level of each fusion algorithm. Combining the above two test

Conclusions
In this study, we propose a novel fusion model for medical image fusion. Aiming at the challenges faced by medical image fusion, first, the model uses the five blocks of VGG-16 to build an encoder to obtain feature maps containing image context information. Second, the model constructs an AM fusion network with the attention mechanism as the core. The network builds blocks around the channel attention mechanism to enhance salient features and weaken redundant features. In order to get more texture details, the network uses different convolution kernels to construct detail information patches to obtain multi-scale features of the image. Finally, all the acquired features are reconstructed by the decoder. The experimental results on the Harvard Medical School brain medical image dataset show that the fused images obtained by the VAnet model are superior to the current more advanced fusion algorithms in terms of structural information and metabolic condition expression. Since the VAnet model can avoid the problem of image fusion sequences, it can be further extended to the field of three medical images fusion.